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Abstract 



in this paper, we report recent miprove- 
ments to the exemplar-based learning ap- 
proach for word sense disambiguation that 
have achieved higher disambiguation accu- 
racy. By using a larger value of fc, the 
number of nearest neighbors to use for de- 
termining the class of a test example, and 
through 10-fold cross validation to auto- 
matically determine the best fc, we have ob- 
tained improved disambiguation accuracy 



1967 ), logic-based DNF and CNF lea rners flMooney. 
1995), and a decision-list learner (Rivest, 1987). 



His results indicate that the simple Naive-Bayes al- 
gorithm gives the highest accuracy on the "line" 
corpus tested. Past research in machine learning 
has also reported that the Naive-Bayes algorithm 
achieved good performance on other machine learn- 



ing tasks (Clark and Niblett, 198S; Kohavi, 1996) 



This is in spite of the conditional independence as- 
sumption made by the Naive-Bayes algorithm, which 
may be unjustified in the domains tested. Gale, 



on a large sense-tagged corpus first used in 



"(Ng and Lee, 1996). The accuracy achieved 
by our improved exemplar-based classifier 
is comparable to the accuracy on the same 
data sot obtained by the Naive Bayoe al 



Chur c h and Yarowsky (Gale et al., 1992a ; Gale et al 



1995 ; Yarowsky, 1992 ) have also successfully used 



the Naive-Bayes algorithm (and several extensions 
and variations) for word sense disambiguation. 

On the other hand, our past work on WSD (Ng 



gorithm, which was reported in (Moonoy 



1996) to have the highest disambiguation 



accuracy among seven state-of-the-art ma- 
chine learning algorithms. 

1 Introduction 

Much recent research on word sense disambigua- 
tion (WSD) has adopted a corpus-based, learning 
approach. Many different learning approaches have 



and Lee, 1996] ) used an exemplar-based (or near- 
est neighbor) learning approach. Our WSD pro- 
gram, Lexas, extracts a set of features, including 
part of speech and morphological form, surrounding 
words, local collocations, and verb-object syntactic 
relation from a sentence containing the word to be 
disambiguated. These features from a sentence form 
an example. Lexas then uses the exemplar-based 



been used, including neural networks ( Leacock et al 



1993 ) , probabilistic algorithms (Bruce and Wiebe 



1994 



Gale et al., 1992a; |G ale et al., 1995 ; Leacock et 



al., 1993 ; Yarowsky, 1992 ), decision lists ( Yarowsky. 



1994 ) , exemplar-based l earning algorithms ( Cardie 



1993| ; [Ng and Lee, 199J ), etc. 

In particular, Mooney ( 1996 ) evaluated seven 
state-of-the-art machine learning algorithms on a 
common data set for disambiguating six senses of 
the word "line" . The seven algorithms that he eval- 



uated are: a Naive-B ayes classifier (Duda and Hart 



1973 ), a perc eptron (Rosenblatt, 1958 ), a decision- 



tree learner (Quinlan, 1992), a k nearest- neighbor 
classifier (exemplar-based learner) ( Cover and Hart 



learning algorithm Pebls (Cost and Salzberg, 1993) 
to find the sense (class) of the word to be disam- 
biguated. 

In this paper, we report recent improvements to 
the exemplar-based learning approach for WSD that 
have achieved higher disambiguation accuracy. The 
exemplar-based learning algorithm Pebls contains 
a number of parameters that must be set before 
running the algorithm. These parameters include 
the number of nearest neighbors to use for deter- 
mining the class of a test example (i.e., fc in a fc 
nearest-neighbor classifier), exemplar weights, fea- 
ture weights, etc. We found that the number fc of 
nearest neighbors used has a considerable impact on 
the accuracy of the induced exemplar-based classi- 
fier. By using 10-fold cross validation ( Kohavi and 



John, 1995) on the training set to automatically de- 
termine the best k to use, we have obtained im- 
proved disambiguation accuracy on a large sense- 
tagged corpus first used in (Ng and Lee, 1996). The 
accuracy achieved by our improved exemplar-based 
classifier is comparable to the accuracy on the same 
data set obtained by the Naive-Bayes algorithm, 
which was reported in (Mooney, 1996) to have the 
highest disambiguation accuracy among seven state- 
of-the-art machine learning algorithms. 

The rest of this paper is organized as follows. Sec- 
tion H gives a brief description of the exemplar-based 
algorithm Pebls and the Naive-Bayes algorithm. 
Section ^ describes the 10-fold cross validation train- 
ing procedure to determine the best k number of 
nearest neighbors to use. Section |J presents the dis- 
ambiguation accurac y of Pebls and Nai ve-Bayes on 
the large corpus of ( Ng and Lee, 1996|) . Section 5 
discusses the implications of the results. Section 6 
gives the conclusion. 

2 Learning Algorithms 

2.1 Pebls 

The heart of exemplar-based learning is a measure 
of the similarity, or distance, between two examples. 
If the distance between two examples is s mall, then 
the two examp les are similar. In Pebls ( Cost and 
Salzberg, 1993| ), the distance between two symbolic 
values v\ and of a feature / is defined as: 



d(v 1 ,v 2 ) =^2\P{Ci\vi) - P(C t \v»)\ 
i=i 

where n is the total number of classes. P(C,*|t?i) 
is estimated by where is the number of 

training examples with value v\ for feature / that 
is classified as class i in the training corpus, and 
Ni is the number of training examples with value 
vi for feature / in any class. P{Ci\V2) is estimated 
similarly. This distance metric of Pebls is adapted 
fro m the value difference m etric of the earlier work 
of ( tBtanfill and Waltz, 1986 ) . The distance between 
two examples is the sum of the distances between 
the values of all the features of the two examples. 

Let k be the number of nearest neighbors to use 
for determining the class of a test example, k > 1. 
During testing, a test example is compared against 
all the training examples. Pebls then determines 
the k training examples with the shortest distance to 
the test example. Among these k closest matching 
training examples, the class which the majority of 
these k examples belong to will be assigned as the 
class of the test example, with tie among multiple 
majority classes broken randomly. 



Note that the nearest neighbor algorithm tested 
in (Mooney, 1996) uses Hamming distance as the 
distance metric between two symbolic feature values. 
This is different from the above distance metric used 
in Pebls. 

2.2 Naive-Bayes 

Our presentation of the Naive-Bayes algorithm 



(Duda and Hart, 1973) follows that of (Clark and 
Niblett, 198£). This algorithm is based on Bayes' 



theorem: 



P{d\ A«j) 



P{Avj\Ci)P{Ci 
P(Av 3 ) 



1...71 



where P(Ci\ A vj) is the probability that a test ex- 
ample is of class Ci given feature values Vj. {Avj 
denotes the conjunction of all feature values in the 
test example.) The goal of a Naive-Bayes classifier 
is to determine the class Cj with the highest condi- 
tional probability P{Ct \ /\Vj). Since the denomina- 
tor P(/\Vj) of the above expression is constant for all 
classes C i; the problem reduces to finding the class 
Ci with the maximum value for the numerator. 

The Naive-Bayes classifier assumes independence 
of example features, so that 



P(Av J \Q) = Y[P(v J \C l ) 



During training, Naive-Bayes constructs the ma- 
trix P(vj\d), and P{Ci) is estimated from the dis- 
tribution of training examples among the classes. To 
avoid one zero count of P(vj\Ci) nullifying the effect 
of the other non-zero conditional probabilities in the 
multiplication, we replace zero counts of P(vj\Ci) by 
P(Ci)/N, where N is the total number of training 
examples. Other more complex smoothing proce- 
dures (such as those used in ( |Gale et al., 1992a| )) are 
also possible, although we have not experimented 
with these other variations. 

For the experimental results reported in this pa- 
per, we used the implementation of Naive-Bayes 
algorithm in t he Pebls program ( Rachlin and 
[Balzberg, 1993 ), which has an option for training 
and testing using the Naive-Bayes algorithm. We 
only changed the handling of zero probability counts 
to the method just described. 



3 Improvements to Exemplar-Based 
WSD 

Pebls contains a number of parameters that must 
be set before running the algorithm. These param- 
eters include k (the number of nearest neighbors to 



use for determining the class of a test example) , ex- 
emplar weights, feature weights, etc. Each of these 
parameters has a default value in Pebls, eg., k = 1, 
no exemplar weighting, no feature weighting, etc. 
We have used the default values for all parame- 
ter settings in our previous work on exemplar-based 
WSD reported in (Ng and Lee, 1996). However, our 
preliminary investigation indicates that, among the 
various learning parameters of Pebls, the number 
k of nearest neighbors used has a considerable im- 
pact on the accuracy of the induced exemplar-based 
classifier. 

Cross validation is a well-known technique that 
can be used for estimating the expected error rate 
of a classifier which has been trained on a particular 



4 Experimental Results 



data set. For instance, the C4.5 program ( Quinlan 
1993| ) contains an option for running cross valida- 



tion to estimate the expected error rate of an in- 
duced rule set. Cross validation has been proposed 
as a general technique to automatically determine 
the parameter settings of a given learning algorithm 



using a particular data set as training data (Kohavi 
and John, 19951). 



In rn-fold cross validation, a training data set 
is partitioned into m (approximately) equal-sized 
blocks, and the learning algorithm is run m times. 
In each run, one of the m blocks of training data is 
set aside as test data (the holdout set) and the re- 
maining to—I blocks are used as training data. The 
average error rate of the m runs is a good estimate 
of the error rate of the induced classifier. 

For a particular parameter setting, we can run 
777-fold cross validation to determine the expected 
error rate of that particular parameter setting. We 
can then choose an optimal parameter setting that 
minimizes the expected error rate. Kohavi and John 
(1995) reported the effectiveness of such a technique 
in obtaining optimal sets of parameter settings over 
a large number of machine learning problems. 

In our present study, we used 10-fold cross vali- 
dation to automatically determine the best k (num- 
ber of nearest neighbors) to use from the training 
data. To determine the best k for disambiguating 
a word on a particular training set, we run 10-fold 
cross validation using Pebls 21 times, each time 
with k = 1, 5, 10, 15, ... , 85, 90, 95, 100. We compute 
the error rate for each k, and choose the value of k 
with the minimum error rate. Note that the auto- 
matic determination of the best k through 10-fold 
cross validation makes use of only the training set, 
without looking at the test set at all. 



Mooney ( |l996| ) has reported that the Naive-Bayes 
algorithm gives the best performance on disam- 
biguating six senses of the word "line" , among seven 
state-of-the-art learning algorithms tested. How- 
ever, his comparative study is done on only one word 
using a data set of 2,094 examples. In our present 
study, we evaluated Pebls and Naive-Bayes on a 
much larger corpus containing sense-tagged occur- 
rences of 121 no uns and 70 verbs. This corpus was 
first reported in ( Ng and Lee, 1996 ), and it contains 
about 192,800 sense-tagged word occurrences of 191 
most frequently occurring and ambiguous words of 
English]] These 191 words have been tagged with 
senses from WordNet (Miller, 199C), an on-line, 



electronic dictionary available publicly. For this set 
of 191 words, the average number of senses per noun 
is 7.8, while the average number of senses per verb is 
12.0. The sentences in this corpus were drawn from 
the combined corpus of the 1 million word Brown 
corpus and the 2.5 million word Wall Street Journal 
(WSJ) corpus. 

We tested both algorithms on two test sets from 
this corpus. The first test set, named BC50, consists 
of 7,119 occurrences of the 191 words appearing in 
50 text files of the Brown corpus. The second test 
set, named WSJ6, consists of 14,139 occurrences of 
the 191 words appearing in 6 text files of the WSJ 
corpus. Both test sets are identical to the ones re- 



ported in ( Ng and Lee, 1996 ). 

Since the primary aim of our present study is the 
comparative evaluation of learning algorithms, not 
feature representation, we have chosen, for simplic- 
ity, to use local collocations as the only features in 
the example representation. Local collocations have 
been found to be the single most informative set 
of features for WSD ( [Ng and Lee, 1996| ). That lo- 
cal collocation knowledge provides important clues 
to WSD has also been pointed out previously by 



Yarowsky (|1993j). 

Let w be the word to be disambiguated, and let 
I2 h w r\ r-2, be the sentence fragment containing 
w. In the present study, we used seven features in 
the representation of an example, which are the local 
collocations of the surrounding 4 words. These seven 
features are: h-h, h-i~i, r\jT2, h, r\, I2, and 7-2. The 
first three features are concatenation of two words.f] 

The experimental results obtained are tabulated 
in Table [|. The first three rows of accuracy fig- 

1 This corpus is available from the Linguistic 
Data Consortium (LDC). Contact the LDC at 
ldc@unagi.cis.upenn.edu for details. 

^The first five of t hese seven features were also used 
in (|Ng and Lee, 1996|). 



Algorithm 


BC50 


WSJ6 


Sense 1 


40.5% 


44.8% 


Most Frequent 


47.1% 


63.7% 


Ng & Lee (1996) 


54.0% 


68.6% 


Pebls (k = 1) 


55.0% 


70.2% 


Pebls (k = 20) 


58.5% 


74.5% 


Pebls (10-fold c.v.) 


58.7% 


75.2% 


Naive-Bayes 


58.2% 


74.5% 



Table 1: Experimental Results 



ures are those of (Ng and Lee, 1996). The default 
strategy of picking the most frequent sense has been 
advocated as the ba seline performanc e for evaluat 
ingWSD programs flGale ct al., 1992b| ; |Miller et aL 



1994 ). There are two instantiations of this strat- 
egy in our current evaluation. Since WordNet or- 
ders its senses such that sense 1 is the most frequent 
sense, one possibility is to always pick sense 1 as 
the best sense assignment. This assignment method 
does not even need to look at the training exam- 
ples. We call this method "Sense 1" in Table |[ An- 
other assignment method is to determine the most 
frequently occurring sense in the training examples, 
and to assign this sense to all test examples. We call 
this method "Most Frequent" in Table |[ 

The accuracy figures of Lexas as reported in (Ng] 



and Lee,_1996 ) are reproduced in the third row 
of Tabic [l]. These figures were obtained using all 
features including part of speech and morphologi- 
cal form, surrounding words, local collocations, and 
verb-object syntactic rela tion. However, th e feature 
value pruning method of ( Ng and Lee, 1996 ) only se- 
lects surrounding words and local collocations as fea- 
ture values if they are indicative of some sense class 
as measured by conditional probability (See (Ng and 
Lee, 1996D for details). 



The next three rows show the accuracy figures of 
Pebls using the parameter setting of k = 1, k = 20, 
and 10-fold cross validation for finding the best k, 
respectively. The last row shows the accuracy fig- 
ures of the Naive-Bayes algorithm. Accuracy figures 
of the last four rows are all based on only seven 
collocation features as described earlier in this sec- 
tion. However, all possible feature values (collocated 
words) are used, without e mploying the featu re value 
pruning method used in ( Ng and Lee, 199q ). 

Note that the accuracy figures of Pebls with 
k = 1 are 1.0% and 1.6% hig her than the accuracy 
figures of ( Ng and Lee, 1996 ) in the third row, also 
with k = 1. The feature value pruning method of 



( Ng and Lee, 1996 ) is intended to keep only feature 
values deemed important for classification. It seems 



that the pruning method has filtered out some useful 
collocation values that improve classification accu- 
racy, such that this unfavorable effect outweighs the 
additional set of features (part of speech and mor- 
phological form, surrounding words, and verb-object 
syntactic relation) used. 

Our results indicate that although Naive-Bayes 
performs better than Pebls with k = 1, Pebls 
with k — 20 achieves comparable performance. Fur- 
thermore, Pebls with 10-fold cross validation to se- 
lect the best k yields results slightly better than the 
Naive-Bayes algorithm. 

5 Discussion 



To understand why larger values of k are needed, 
we examined the performance of Pebls when tested 
on the WSJ6 test set. During 10-fold cross valida- 
tion runs on the training set, for each of the 191 
words, we compared two error rates: the minimum 
expected error rate of Pebls using the best k, and 
the expected error rate of the most frequent clas- 
sifier. We found that for 13 words out of the 191 
words, the minimum expected error rate of Pebls 
using the best k is still higher than the expected 
error rate of the most frequent classifier. That is, 
for these 13 words, Pebls will produce, on average, 
lower accuracy than the most frequent classifier. 

Importantly, for 11 of these 13 words, the best k 
found by Pebls are at least 85 and above. This in- 
dicates that for a training data set when Pebls has 
trouble even outperforming the most frequent clas- 
sifier, it will tend to use a large value for k. This is 
explainable since for a large value of k, Pebls will 
tend towards the performance of the most frequent 
classifier, as it will find the k closest matching train- 
ing examples and select the majority class among 
this large number of k examples. Note that in the 
extreme case when k equals the size of the training 
set, Pebls will behave exactly like the most frequent 
classifier. 

Our results indicate that although Pebls with 
k = 1 gives lower accuracy compared with Naive- 
Bayes, Pebls with k = 20 performs as well as Naive- 
Bayes. Furthermore, Pebls with automatically se- 
lected k using 10-fold cross validation gives slightly 
higher performance compared with Naive-Bayes. We 
believe that this result is significant, in light of the 
fact that Naive-Bayes has been found to give the 
best performance for WSD among seven state-of - 
the-art machine learning algorithms (Mooney, 1996). 
It demonstrates that an exemplar-based learning ap- 
proach is suitable for the WSD task, achieving high 
disambiguation accuracy. 

One potential drawback of an exemplar-based 



learning approach is the testing time required, since 
each test example must be compared with every 
training example, and hence the required testing 
time grows linearly with the size of the training set. 
However, more sophisticated indexing methods such 



as that reported in (Friedman et al., 1977) can re- 
duce this to logarithmic expected time, which will 
significantly reduce testing time. 

In the present study, we have focused on the com- 
parison of learning algorithms, but not on feature 
representation of examples. Our past work (Ng and 



Lee, 1996) suggests that multiple sources of knowl- 



edge are indeed useful for WSD. Future work will 
explore the addition of these other features to fur- 
ther improve disambiguation accuracy. 

Besides the parameter k, Pebls also contains 
other learning parameters such as exemplar weights 
and feature weights. Exemplar weighting has been 
found to improve classification performance (Cost 
and Salzberg, 1993). Also, given the relative impor- 
tance of the various knowledge sources as reported 
in (Ng and Lee, 1996), it may be possible to improve 
disambiguation performance by introducing feature 
weighting. Future work can explore the effect of ex- 
emplar weighting and feature weighting on disam- 
biguation accuracy. 

6 Conclusion 

In summary, we have presented improvements to the 
exemplar-based learning approach for WSD. By us- 
ing a larger value of k, the number of nearest neigh- 
bors to use for determining the class of a test ex- 
ample, and through 10-fold cross validation to au- 
tomatically determine the best k, we have obtained 
improved disambiguation accuracy on a large sense- 
tagged corpus. The accuracy achieved by our im- 
proved exemplar-based classifier is comparable to 
the accuracy on the same data set obtained by the 
Naive-Bayes algorithm, which was recently reported 
to have the highest disambiguation accuracy among 
seven state-of-the-art machine learning algorithms. 
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